This report explores a dataset containing red wine quality on 11 different variables.
The dataset consists of 13 variables of 1599 observations.
## 'data.frame': 1599 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality.factor : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## $ quality.cat : Factor w/ 3 levels "bad","Average",..: 2 2 2 2 2 2 2 3 3 2 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality quality.factor quality.cat
## Min. : 8.40 Min. :3.000 3: 10 bad : 63
## 1st Qu.: 9.50 1st Qu.:5.000 4: 53 Average:1319
## Median :10.20 Median :6.000 5:681 good : 217
## Mean :10.42 Mean :5.636 6:638
## 3rd Qu.:11.10 3rd Qu.:6.000 7:199
## Max. :14.90 Max. :8.000 8: 18
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
It looks like most wines were rated average between 5 and 6 with a mean of 5.6 and median 6. No wine was rated 1 and none was given 9 or 10. A very small percentage is rated 8 and even a smaller percentage is rated 3. It would be interesting to compare qualities of these two categories (perhaps 3 vs 8) to see if there is stark difference in specific attributes.
For now, let’s look at other variables to see what we can find.
## # A tibble: 6 × 6
## quality mean_alc median_alc min_alc max_alc n
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 3 9.955000 9.925 8.4 11.0 10
## 2 4 10.265094 10.000 9.0 13.1 53
## 3 5 9.899706 9.700 8.5 14.9 681
## 4 6 10.629519 10.500 8.4 14.0 638
## 5 7 11.465913 11.500 9.2 14.0 199
## 6 8 12.094444 12.150 9.8 14.0 18
Mean alcohol content in highly rated wines is much higher than wines rated low.
We can see that majority of the wines had an alcohol content between 9.5 and 10 which is interesting because based on common knowledge most wines have alcohol content of 13.5% alc/vol. This indicates most were low in alcohol content which could indicate why most have not been rated very high.
Fixed acidity shows a normal plot.
Let’s look at volatile acidity next since a high concentration of that will result in a vinegary taste in wines and in theory affects quality.
## # A tibble: 6 × 6
## quality mean_acid median_acid min_acid max_acid n
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 3 0.8845000 0.845 0.44 1.580 10
## 2 4 0.6939623 0.670 0.23 1.130 53
## 3 5 0.5770411 0.580 0.18 1.330 681
## 4 6 0.4974843 0.490 0.16 1.040 638
## 5 7 0.4039196 0.370 0.12 0.915 199
## 6 8 0.4233333 0.370 0.26 0.850 18
We can see that mean and median acidity levels decrease as quality ratings increase.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
We do see some that are highly acidic > 1 and some >1.5 but for the most part they lie around 0.5 gm/l
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
We have a lot of wines with fairly low sugar content between 1.5 and 2.5 with 2 gm/liter as being the most common. We don’t seem to have any wines considered “very sweet” (>45g/l) since the sugar content ranges from 0.9 to 15.5 gm/liter only. Let’s do a log transform to normalize the data.
Also, We do notice a small batch of wines with sugar content >8 but they are very small in number. Let’s zoom in to take a closer look.
2 wines rated good have low sugar content and the one rated bad has high sugar. But, we do have wines rated average with sugar content varying across the spectrum and is a definite candidate for a bivariate analysis. Next, let’s look at a few other variables.
Most don’t differ too much in density. Since density is affected by alcohol and sugar content, we’ll table this for now.
We see almost a bimodal distribution.
Majority of wines have a varying amount of citric acid in tiny amounts. We find a lot of wines having little to no citric acid. More than 110 wines have no citric acid and have been rated average.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Majority appear to have low quantities of chloride mostly between possibly 0.01 and less than 0.2. Let’s zoom in on these ignoring the tail for now.
Quite apparent that most wines have chloride values between 0.05 and 0.1. Now let’s look at the tail.
We see a handful of wines with a high chloride aka salt content. We have most of these rated average and just a couple rated good. Therefore, it looks like though high chloride content doesn’t contribute to good wines, it doesn’t always lead to bad wines either.
Let’s look at free sulphur dioxide since SO2 affects oxidation which in turn is known to affect quality of wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
It looks like we do have wines with >50 mg/l of SO2. The dataset explains that with concentrations higher than that, it affects the taste and smell of wine. Common knowledge does indicate SO2 has a foul smelling odor so let’s investigate this.
We have about 6 wines with a high free SO2 content. Surprisingly, none of them were rated bad!
Let’s then take a look at total SO2 and apply log transformation to get a normal curve.
We note that a small number of wines have total SO2 content around 289 but let’s focus on the high tail. At 30 mg/l, close to 300 wines seem to have that amount.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Looking at pH and sulfates
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Bulk of the wines have a pH between 3.2 and 3.5 which agrees with known fact that most wines are fairly acidic with range between 3-4 on a pH scale. We do see one wine close at3.9 and another wine as low as 2.7.Both were rated “bad”.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
suphates contributes to SO2 gas which prevents oxidation and also acts as antimicrobial according to given information. The concentration varies from 0.3 to 1.4 for most wines. We do see some high concentrations of around 2 in some wines.
There are 1599 red wines that have been rated on a scale from 1 to 10, 10 being the best and 1 being the worst. 11 different input variables (Ex.pH, alcohol content etc) have been tested for and the output variable (quality) has been measured. All the input variables are numeric while output variable is int.
Other observations: 1. No wines have been given either a very poor rating (1) or very excellent rating (9 or 10) 2. Average rating of the wines is 5.6 with a median of 6. Least rating is 3 and max rating is 8.
Based on common knowledge, factors such as sugar, alcohol and acidity affect the taste and quality. But it would be interesting to plot a scatter plot matrix to understand effect of other variables as well as between input variables.
I think sulphates, pH and sulphur di oxide also affect quality.
Yes. Since quality is numeric, I changed it to a qualitative variable - a factor variable named quality.factor for ease of analysis
rw$quality.factor <- factor(rw$quality)
rw$quality.cat <- NA
rw$quality.cat <- ifelse(rw$quality>=7, 'good', NA)
rw$quality.cat <- ifelse(rw$quality<=4, 'bad', rw$quality.cat)
rw$quality.cat <- ifelse(rw$quality==5, 'Average', rw$quality.cat)
rw$quality.cat <- ifelse(rw$quality==6, 'Average', rw$quality.cat)
rw$quality.cat <- factor(rw$quality.cat, levels = c("bad", "Average", "good"))
residual sugar and total sulfur di oxide was not normal data so I transformed it using the log function in order to normalize the data.
Drawing a sample of 1000, let’s look at ggpairs plots.
With respect to quality, there’s a moderate positive correlation to alcohol content (0.504) and a weak positive correlation to citric acid (.223) and sulphates concentration (.267).
There is a weak negative correlation of quality to volatile acid (-.377) and total sulphur di oxide (-.196).
Surprised to learn pH is not highly correlated to quality (-0.068) and neither is chlorides (-0.129)
Interestingly, there is strong positive correlation between fixed acidity and density (.658) as well as between citric acid and fixed acidity (.665). The latter maybe just a result of acidic content measurement since citric acid content would contribute to the acidic content. The former would be interesting to explore. Fixed acidity is highly positively correlated to density and highly negatively correlated to pH as shown.
Let us analyze the quantitativeness of the variables using boxplots.
We can see box plots of density and pH are almost mirror images of each other. Let’s further explore scatter plots of just these variables of interest.
We see most wines rated bad are low in alcohol and most wines rated good are high in alcohol content. But, we also see wines with high alcohol content rated average as shown by the outliers. Therefore, clearly some other factors also contribute to quality.
Regards to total sulphur dioxide content, most rated average have low levels. Therefore it is likely that this variable was not a major contributing factor affecting quality. Unless we have more data (i.e more wines rated bad and good in the data set) and we are able to analyze So2 content there, we can table this.
Let’s further explore the relationship between variables. I think pH and acidity having a negative correlation is expected given that pH is a measure of acidic content.
We see most wines rated bad are low in alcohol and most wines rated good are high in alcohol content. But, we also see wines with high alcohol content rated average as shown by the outliers. Similar was the case with citric acid where presence in fair amounts is good but a lot of it doesn’t necessarily mean good wine and the absence of it also doesn’t imply bad wine quality. Therefore, clearly a lot of factors play together into determining quality.
Based on common knowledge and what was a given information, I assumed a high concentration of sulfur di oxide (>50 mg/l) would lead to bad wines but that was not so when I drew up the plot of free so2 vs quality.
The strongest relationship (for quality) was definitely to the alcohol content. For wines with high alcohol content, there were other factors that played into determining quality but there were no wines rated good that were low in alcohol content which strengthened the results.
## $title
## [1] "citric acid vs alcohol by quality"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
Looking at 1st and 3rd plots, wines rated good generally have higher citric acid content,implying fresher wines and some that had low citric acid content but rated higher, had high alcohol content.
##
## Calls:
## m1: lm(formula = I(rw$quality) ~ I(rw$alcohol), data = rw)
## m2: lm(formula = I(rw$quality) ~ I(rw$alcohol) + rw$citric.acid,
## data = rw)
## m3: lm(formula = I(rw$quality) ~ I(rw$alcohol) + rw$citric.acid +
## rw$sulphates, data = rw)
## m4: lm(formula = I(rw$quality) ~ I(rw$alcohol) + rw$citric.acid +
## rw$sulphates + rw$volatile.acidity, data = rw)
##
## ===================================================================
## m1 m2 m3 m4
## -------------------------------------------------------------------
## (Intercept) 1.875*** 1.830*** 1.434*** 2.646***
## (0.175) (0.171) (0.176) (0.201)
## I(rw$alcohol) 0.361*** 0.346*** 0.338*** 0.309***
## (0.017) (0.016) (0.016) (0.016)
## rw$citric.acid 0.730*** 0.513*** -0.079
## (0.090) (0.093) (0.104)
## rw$sulphates 0.814*** 0.696***
## (0.107) (0.103)
## rw$volatile.acidity -1.265***
## (0.113)
## -------------------------------------------------------------------
## R-squared 0.227 0.257 0.284 0.336
## adj. R-squared 0.226 0.256 0.282 0.334
## sigma 0.710 0.696 0.684 0.659
## F 468.267 276.595 210.501 201.777
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1688.711 -1659.955 -1599.093
## Deviance 805.870 773.917 746.576 691.852
## AIC 3448.114 3385.421 3329.910 3210.186
## BIC 3464.245 3406.930 3356.795 3242.448
## N 1599 1599 1599 1599
## ===================================================================
I think it was understanding the relationship between alcohol and citric acid that both contributed to good quality wines. Also, learning that although a good concentration of both is important for a good quality, too much of it doesn’t necesssarily imply that you would get a good wine.
The interaction between sulphates and alcohol content was interesting. Wines rated bad were mostly clustered around the bottom left with low concentration of sulphates and on the right were mostly rated average and good with most of the good wines clusterted on the top right with high concentrations of both.
Yes, I created a regression model to predict the quality of wines based on selected independent variables in the dataset.
The model explains about 34% of cases with the highest R quared when 4 variables (alcohol, citric acid, sulphates and volatile acidity ) are included.
There are some limitations of this model given that ggpairs is based off of a sample of the dataset and any additional data might alter the results. Also, this dataset is for nearly 1600 wines with very few wines on either side of the quality spectrum so any additions to this especially adding to very low quality (3) or very high quality (8 or above) is likely to affect the results.
We can see how wines rated bad(red) are around the bottom left and wines rated good are clustered around the top right (green). We have average wines clustered around the middle. This provides an explanation to an extent between the interaction between some of the variables that affect quality.
## List of 2
## $ axis.text.x:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 5
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text.y:List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : num 12
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi FALSE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
The box plots clearly depict the variation of each of the factors with quality. We can see how the mean moves higher with increasing quality for variables - alcohol, citric acid and sulphates. We can also see the reverse scenario for variables such as volatile acidity and density.
Pitting acid against alcohol, we can see good wines (blue) clustered mostly on top of the plot, average in the middle (green) and bad wines (in red) clustered around the bottom. It is evident citric acid, results in fresh tasting wines and therefore, better quality wines. However, too much of it doesn’t necessarily imply good wines as can be seen from a green point on the top right with both high citric acid and high alcohol content.
We see a lot of good wines centered in the middle (alcohol content between 10 and 13) but on the left, where alcohol content is low, we see mostly wines rated bad and average. Also, on the right area of the plot where alcohol content is high, we mostly see good and average wines but not bad wines. Definitely, we can derive that alcohol content is a good predictor of quality of wine. ——
It was interesting to see some affirmations for some assumptions such as alcohol content and citric acid yielding good quality wines. It was surprising to learn how some variables such as sugar not really affecting quality as much. Perhaps there is a lot of difference between what we know as a fact versus what we perceive.
Since there was little data on both good and bad wines but more on average wines, it was hard to delineate a pattern from such limited information.
Also, the quality of these wines have been based on what can be measured quantitavely but other factors such as color and smell which are also huge determining factors in deciding quality would be interesting to analyze.
https://s3.amazonaws.com/content.udacity-data.com/courses/ud651/diamondsExample_2016-05.html https://discussions.udacity.com/t/ggpairs-function/287231/11 https://rpubs.com/jeknov/redwine https://discussions.udacity.com/t/spoiler-code-need-helping-knitting-my-rmd-file-into-an-html-doc/294489